Organization Background:
ABC Technologies, is facing a concerning trend of high employee attrition even though they offer a competitive salary and benefits package.
Problem Statement:
This suggests the root cause of attrition lies beyond financial compensation. Company wants to identify the underlying factors driving employee departures to improve retention and maintain a strong talent pool.
Challenges Faced:
Analyze & Visualize the employee data to uncover trends and patterns related to employee attrition
Identify key attributes associated with departing employees, viz;
Build a regression model to predict if an employee is at a risk of attrition
Analyze model performance to identify the most significant predictors of attrition
# Importing sys and os
# OS module is responsible for interacting with the operating system, providing access to the underlying interface of the operating system
# SYS module is responsible for the interaction between the program and the Python interpreter
import sys
import os
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
#pd.options.display.max_columns = None
# The warnings filter controls whether warnings are ignored, displayed, or turned into errors (raising an exception)
# 'ignore' = never print matching warnings
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
pd.set_option('display.max_columns', 50)
import numpy as np
import re
from scipy import stats
import seaborn as sns
sns.set(rc = {'figure.figsize' : (14, 10)})
sns.set_style('whitegrid')
sns.color_palette('dark')
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from matplotlib import rcParams
rcParams['lines.linewidth'] = 2
rcParams['xtick.labelsize'] = 9
rcParams['ytick.labelsize'] = 9
rcParams['axes.labelsize'] = 9
rcParams['axes.titlesize'] = 10
rcParams['axes.titlepad'] = 10.0
rcParams['figure.dpi'] = 300
rcParams['savefig.dpi'] = 300
import sklearn as sk
from sklearn.model_selection import train_test_split #, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
# from sklearn.linear_model import LogisticRegression
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier
# import xgboost as xgb
# from xgboost import XGBClassifier
# import imblearn as im
# from imblearn import under_sampling, over_sampling
# set custom font ()
font_path = '/content/drive/MyDrive/Colab Notebooks/resources/BentonSans-Regular.ttf'
import matplotlib.font_manager as fm
fm.fontManager.addfont(font_path)
plt.rc('font', family='BentonSans')
plt.figure(facecolor='white')
#check the version
print('python', sys.version)
for pkg in [pd, np, mpl, sns, sk]: #, im, xgb
print(pkg.__name__, pkg.__version__)
Mounted at /content/drive python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] pandas 2.0.3 numpy 1.25.2 matplotlib 3.7.1 seaborn 0.13.1 sklearn 1.2.2
<Figure size 4200x3000 with 0 Axes>
The dataset that will be used is taken from kaggle and contains HR analytics data of employees that stay and leave
The types of data include metrics such as Education, Environment Satisfaction, Job Involvement, Job Satisfaction, Performance Rating, Relationship Satisfaction, Work-Life Balance, Commute Distance, etc.
Asset:Employee Attrition Data
License: Open Database License / Database Content License
Source Link: Kaggle
Reading The File
gd_path = '/content/drive/MyDrive/Colab Notebooks'
fname = 'dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv'
data = f"{gd_path}/{fname}"
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
Taking the glance at the data
# Take a glance at first 10 rows for quickly testing if our object has the right type of data in it
df.head(10)
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
| 5 | 32 | No | Travel_Frequently | 1005 | Research & Development | 2 | 2 | Life Sciences | 1 | 8 | 4 | Male | 79 | 3 | 1 | Laboratory Technician | 4 | Single | 3068 | 11864 | 0 | Y | No | 13 | 3 | 3 | 80 | 0 | 8 | 2 | 2 | 7 | 7 | 3 | 6 |
| 6 | 59 | No | Travel_Rarely | 1324 | Research & Development | 3 | 3 | Medical | 1 | 10 | 3 | Female | 81 | 4 | 1 | Laboratory Technician | 1 | Married | 2670 | 9964 | 4 | Y | Yes | 20 | 4 | 1 | 80 | 3 | 12 | 3 | 2 | 1 | 0 | 0 | 0 |
| 7 | 30 | No | Travel_Rarely | 1358 | Research & Development | 24 | 1 | Life Sciences | 1 | 11 | 4 | Male | 67 | 3 | 1 | Laboratory Technician | 3 | Divorced | 2693 | 13335 | 1 | Y | No | 22 | 4 | 2 | 80 | 1 | 1 | 2 | 3 | 1 | 0 | 0 | 0 |
| 8 | 38 | No | Travel_Frequently | 216 | Research & Development | 23 | 3 | Life Sciences | 1 | 12 | 4 | Male | 44 | 2 | 3 | Manufacturing Director | 3 | Single | 9526 | 8787 | 0 | Y | No | 21 | 4 | 2 | 80 | 0 | 10 | 2 | 3 | 9 | 7 | 1 | 8 |
| 9 | 36 | No | Travel_Rarely | 1299 | Research & Development | 27 | 3 | Medical | 1 | 13 | 3 | Male | 94 | 3 | 2 | Healthcare Representative | 3 | Married | 5237 | 16577 | 6 | Y | No | 13 | 3 | 2 | 80 | 2 | 17 | 3 | 2 | 7 | 7 | 7 | 7 |
# create a backup of original dataset for future use
df_raw = df.copy()
Understanding The Dataframe
# Let's see number of columns, column labels, column data types, memory usage, range index, and non-null information about DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB
From the information above, we can know that:
attrition column is a classification target with the data type objectint64 (26 columns) and object (8 columns)# Qualitative - Nominal (N)
qual_n = [
'Attrition', # target
'BusinessTravel',
'Department',
'EducationField',
'EmployeeNumber', # id
'Gender',
'JobRole',
'MaritalStatus',
'Over18',
'OverTime',
]
# Qualitative - Ordinal (O)
qual_o = [
'Education', # desc
'EnvironmentSatisfaction', # desc
'JobInvolvement', # desc
'JobLevel',
'JobSatisfaction', # desc
'PerformanceRating', # desc
'RelationshipSatisfaction', # desc
'StockOptionLevel',
'WorkLifeBalance', # desc
]
#Combined Qualitative Data Set (Nominal + Ordinal)
qual = qual_n + qual_o
df_qual = df[qual]
df_qual.head(10)
| Attrition | BusinessTravel | Department | EducationField | EmployeeNumber | Gender | JobRole | MaritalStatus | Over18 | OverTime | Education | EnvironmentSatisfaction | JobInvolvement | JobLevel | JobSatisfaction | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | WorkLifeBalance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Yes | Travel_Rarely | Sales | Life Sciences | 1 | Female | Sales Executive | Single | Y | Yes | 2 | 2 | 3 | 2 | 4 | 3 | 1 | 0 | 1 |
| 1 | No | Travel_Frequently | Research & Development | Life Sciences | 2 | Male | Research Scientist | Married | Y | No | 1 | 3 | 2 | 2 | 2 | 4 | 4 | 1 | 3 |
| 2 | Yes | Travel_Rarely | Research & Development | Other | 4 | Male | Laboratory Technician | Single | Y | Yes | 2 | 4 | 2 | 1 | 3 | 3 | 2 | 0 | 3 |
| 3 | No | Travel_Frequently | Research & Development | Life Sciences | 5 | Female | Research Scientist | Married | Y | Yes | 4 | 4 | 3 | 1 | 3 | 3 | 3 | 0 | 3 |
| 4 | No | Travel_Rarely | Research & Development | Medical | 7 | Male | Laboratory Technician | Married | Y | No | 1 | 1 | 3 | 1 | 2 | 3 | 4 | 1 | 3 |
| 5 | No | Travel_Frequently | Research & Development | Life Sciences | 8 | Male | Laboratory Technician | Single | Y | No | 2 | 4 | 3 | 1 | 4 | 3 | 3 | 0 | 2 |
| 6 | No | Travel_Rarely | Research & Development | Medical | 10 | Female | Laboratory Technician | Married | Y | Yes | 3 | 3 | 4 | 1 | 1 | 4 | 1 | 3 | 2 |
| 7 | No | Travel_Rarely | Research & Development | Life Sciences | 11 | Male | Laboratory Technician | Divorced | Y | No | 1 | 4 | 3 | 1 | 3 | 4 | 2 | 1 | 3 |
| 8 | No | Travel_Frequently | Research & Development | Life Sciences | 12 | Male | Manufacturing Director | Single | Y | No | 3 | 4 | 2 | 3 | 3 | 4 | 2 | 0 | 3 |
| 9 | No | Travel_Rarely | Research & Development | Medical | 13 | Male | Healthcare Representative | Married | Y | No | 3 | 3 | 3 | 2 | 3 | 3 | 2 | 2 | 2 |
# Let's check the data types for identified Nominal Data
df[qual_n].dtypes
Attrition object BusinessTravel object Department object EducationField object EmployeeNumber int64 Gender object JobRole object MaritalStatus object Over18 object OverTime object dtype: object
# Let's check the data types for identified Ordinal Data (if there is need to decode the columns)
df[qual_o].dtypes
Education int64 EnvironmentSatisfaction int64 JobInvolvement int64 JobLevel int64 JobSatisfaction int64 PerformanceRating int64 RelationshipSatisfaction int64 StockOptionLevel int64 WorkLifeBalance int64 dtype: object
df[qual].dtypes
Attrition object BusinessTravel object Department object EducationField object EmployeeNumber int64 Gender object JobRole object MaritalStatus object Over18 object OverTime object Education int64 EnvironmentSatisfaction int64 JobInvolvement int64 JobLevel int64 JobSatisfaction int64 PerformanceRating int64 RelationshipSatisfaction int64 StockOptionLevel int64 WorkLifeBalance int64 dtype: object
# Quantitative - Interval (I)
quant_i = [
'Age',
'DistanceFromHome',
'EmployeeCount', # 1
'NumCompaniesWorked',
'StandardHours', # 1
'TrainingTimesLastYear'
]
# Quantitative Ratio (R)
quant_r = [
'DailyRate',
'HourlyRate',
'MonthlyRate',
'PercentSalaryHike', # percentage
'TotalWorkingYears',
'MonthlyIncome',
'YearsAtCompany',
'YearsInCurrentRole',
'YearsSinceLastPromotion',
'YearsWithCurrManager'
]
quant = quant_i + quant_r
df_quant = df[quant]
df_quant.head(3)
| Age | DistanceFromHome | EmployeeCount | NumCompaniesWorked | StandardHours | TrainingTimesLastYear | DailyRate | HourlyRate | MonthlyRate | PercentSalaryHike | TotalWorkingYears | MonthlyIncome | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 1 | 1 | 8 | 80 | 0 | 1102 | 94 | 19479 | 11 | 8 | 5993 | 6 | 4 | 0 | 5 |
| 1 | 49 | 8 | 1 | 1 | 80 | 3 | 279 | 61 | 24907 | 23 | 10 | 5130 | 10 | 7 | 1 | 7 |
| 2 | 37 | 2 | 1 | 6 | 80 | 3 | 1373 | 92 | 2396 | 15 | 7 | 2090 | 0 | 0 | 0 | 0 |
# copying the raw dataset
df_clean = df_raw.copy()
df_raw = raw datasetdf_clean = dataset that has been cleaned (drop columns, drop rows, encoding, etc.)Note: Use df_clean if you want to carry out a cleansing process so that the raw data retains its value.
Note:
employee number).df_clean.nunique()
Age 43 Attrition 2 BusinessTravel 3 DailyRate 886 Department 3 DistanceFromHome 29 Education 5 EducationField 6 EmployeeCount 1 EmployeeNumber 1470 EnvironmentSatisfaction 4 Gender 2 HourlyRate 71 JobInvolvement 4 JobLevel 5 JobRole 9 JobSatisfaction 4 MaritalStatus 3 MonthlyIncome 1349 MonthlyRate 1427 NumCompaniesWorked 10 Over18 1 OverTime 2 PercentSalaryHike 15 PerformanceRating 2 RelationshipSatisfaction 4 StandardHours 1 StockOptionLevel 4 TotalWorkingYears 40 TrainingTimesLastYear 7 WorkLifeBalance 4 YearsAtCompany 37 YearsInCurrentRole 19 YearsSinceLastPromotion 16 YearsWithCurrManager 18 dtype: int64
Observation:
EmployeeCount, Over18, StandardHours are columns that only have 1 unique value.EmployeeNumber is a column whose entire row contains a unique value.bef = len(df_clean.columns) # number of columns before removal
df_clean = df_clean.drop(['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber'], axis = 1) # columns dropped
aft = len(df_clean.columns) # number of columns after removal
print(f"The number of columns are reduced from {bef} to {aft}.")
The number of columns are reduced from 35 to 31.
Observation:
df_clean.isnull().sum()
Age 0 Attrition 0 BusinessTravel 0 DailyRate 0 Department 0 DistanceFromHome 0 Education 0 EducationField 0 EnvironmentSatisfaction 0 Gender 0 HourlyRate 0 JobInvolvement 0 JobLevel 0 JobRole 0 JobSatisfaction 0 MaritalStatus 0 MonthlyIncome 0 MonthlyRate 0 NumCompaniesWorked 0 OverTime 0 PercentSalaryHike 0 PerformanceRating 0 RelationshipSatisfaction 0 StockOptionLevel 0 TotalWorkingYears 0 TrainingTimesLastYear 0 WorkLifeBalance 0 YearsAtCompany 0 YearsInCurrentRole 0 YearsSinceLastPromotion 0 YearsWithCurrManager 0 dtype: int64
Observation:
df_clean.duplicated().sum()
0
Observation:
df_quant.describe()
| Age | DistanceFromHome | EmployeeCount | NumCompaniesWorked | StandardHours | TrainingTimesLastYear | DailyRate | HourlyRate | MonthlyRate | PercentSalaryHike | TotalWorkingYears | MonthlyIncome | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1470.000000 | 1470.000000 | 1470.0 | 1470.000000 | 1470.0 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 |
| mean | 36.923810 | 9.192517 | 1.0 | 2.693197 | 80.0 | 2.799320 | 802.485714 | 65.891156 | 14313.103401 | 15.209524 | 11.279592 | 6502.931293 | 7.008163 | 4.229252 | 2.187755 | 4.123129 |
| std | 9.135373 | 8.106864 | 0.0 | 2.498009 | 0.0 | 1.289271 | 403.509100 | 20.329428 | 7117.786044 | 3.659938 | 7.780782 | 4707.956783 | 6.126525 | 3.623137 | 3.222430 | 3.568136 |
| min | 18.000000 | 1.000000 | 1.0 | 0.000000 | 80.0 | 0.000000 | 102.000000 | 30.000000 | 2094.000000 | 11.000000 | 0.000000 | 1009.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 30.000000 | 2.000000 | 1.0 | 1.000000 | 80.0 | 2.000000 | 465.000000 | 48.000000 | 8047.000000 | 12.000000 | 6.000000 | 2911.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 |
| 50% | 36.000000 | 7.000000 | 1.0 | 2.000000 | 80.0 | 3.000000 | 802.000000 | 66.000000 | 14235.500000 | 14.000000 | 10.000000 | 4919.000000 | 5.000000 | 3.000000 | 1.000000 | 3.000000 |
| 75% | 43.000000 | 14.000000 | 1.0 | 4.000000 | 80.0 | 3.000000 | 1157.000000 | 83.750000 | 20461.500000 | 18.000000 | 15.000000 | 8379.000000 | 9.000000 | 7.000000 | 3.000000 | 7.000000 |
| max | 60.000000 | 29.000000 | 1.0 | 9.000000 | 80.0 | 6.000000 | 1499.000000 | 100.000000 | 26999.000000 | 25.000000 | 40.000000 | 19999.000000 | 40.000000 | 18.000000 | 15.000000 | 17.000000 |
df_quant.shape[0]
1470
plt.figure(figsize = (15, 8))
for i in range(0, len(quant)):
plt.subplot(2, 8, i+1)
sns.boxplot(y = df[quant[i]], color = '#005495', orient = 'v')
plt.title(quant[i], fontweight = 'bold')
plt.tight_layout(pad = 2.0)
plt.show()
From the box plot above, we can see that:
TotalWorkingYearsTrainingTimesLastYearYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManagerMonthlyIncome# Considering the above chart, column from which outliers needs to be removed are;
cols = ['TrainingTimesLastYear','TotalWorkingYears','MonthlyIncome','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']
# Before
plt.figure(figsize = (14, 4))
for i in range(0, len(cols)):
plt.subplot(1, 7, i+1)
sns.boxplot(y = df_clean[cols[i]], color = '#005495', orient = 'v')
plt.suptitle('Outliers Before Cleansing', y = 1.05, fontsize = 12)
plt.tight_layout(pad = 1.5)
plt.show()
print(f'Number of rows before removing outliers: {len(df_clean)}')
filtered_entries = np.array([True] * len(df_clean))
for col in cols:
zscore = abs(stats.zscore(df_clean[col])) # calculate the absolute z-score
filtered_entries = (zscore < 3) & filtered_entries # keep less than 3 absolute z-scores
df_ZSCORE = df_clean[filtered_entries] # filter, only take those whose z-score is below 3
print(f'Number of rows after removing outliers: {len(df_ZSCORE)}')
Number of rows before removing outliers: 1470 Number of rows after removing outliers: 1387
# After using Z-SCORE
plt.figure(figsize = (14, 4))
for i in range(0, len(cols)):
plt.subplot(1, 7, i+1)
sns.boxplot(y = df_ZSCORE[cols[i]], color = '#005495', orient = 'v')
plt.suptitle('Outliers After Cleansing by Z-Score', y = 1.05, fontsize = 12)
plt.tight_layout(pad = 1.5)
plt.show()
print(f'Number of rows before removing outliers: {len(df_clean)}')
filtered_entries = np.array([True] * len(df_clean))
for col in cols:
Q1 = df_clean[col].quantile(0.25)
Q3 = df_clean[col].quantile(0.75)
IQR = Q3 - Q1
low_limit = Q1 - (IQR * 1.5)
high_limit = Q3 + (IQR * 1.5)
filtered_entries = ((df_clean[col] >= low_limit) & (df_clean[col] <= high_limit)) & filtered_entries
df_IQR = df_clean[filtered_entries]
print(f'Number of rows after removing outliers: {len(df_IQR)}')
Number of rows before removing outliers: 1470 Number of rows after removing outliers: 1024
# After using IQR
plt.figure(figsize = (14, 4))
for i in range(0, len(cols)):
plt.subplot(1, 7, i+1)
sns.boxplot(y = df_IQR[cols[i]], color = '#005495', orient = 'v')
plt.suptitle('Outliers After Cleansing by IQR', y = 1.05, fontsize = 12)
plt.tight_layout(pad = 1.5)
plt.show()
Z-Score
IQR
Observation:
df_clean = df_ZSCORE.copy()
df_clean.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Yes | 11 | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | No | 23 | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Yes | 15 | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Yes | 11 | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | No | 12 | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
Calculate Attrition Rate for ABC Technologies
How to Calculate Employee Turnover Rate?
To calculate employee turnover rate, you’ll need to know, within a given time period, the number of employee separations and the average number of employees present. The average number of employees can be determined by adding together the number of employees on the first day and the last day of the time period, then dividing that sum total by two.
Then, plug the above values into the following turnover rate formula:
Turnover rate = [(# of employee separations) / (average # of employees)] x 100
df_attrition = df[['Attrition', 'JobLevel', 'MonthlyIncome', 'YearsAtCompany']].copy()
df_attrition.head()
| Attrition | JobLevel | MonthlyIncome | YearsAtCompany | |
|---|---|---|---|---|
| 0 | Yes | 2 | 5993 | 6 |
| 1 | No | 2 | 5130 | 10 |
| 2 | Yes | 1 | 2090 | 0 |
| 3 | No | 1 | 2909 | 8 |
| 4 | No | 1 | 3468 | 2 |
Add a column to identify if the employee joined this year
# is a new employee this year?
ls_jointhisyear = []
for idx, col in df_attrition.iterrows():
if col['YearsAtCompany'] < 1:
ls_jointhisyear.append(1) # worked for < 1 year
else:
ls_jointhisyear.append(0) # worked for > 1 year
df_attrition['JoinThisYear'] = ls_jointhisyear
df_attrition.head()
| Attrition | JobLevel | MonthlyIncome | YearsAtCompany | JoinThisYear | |
|---|---|---|---|---|---|
| 0 | Yes | 2 | 5993 | 6 | 0 |
| 1 | No | 2 | 5130 | 10 | 0 |
| 2 | Yes | 1 | 2090 | 0 | 1 |
| 3 | No | 1 | 2909 | 8 | 0 |
| 4 | No | 1 | 3468 | 2 | 0 |
Calculate the Attrition Rate
employee_start = df_attrition.loc[df_attrition['JoinThisYear'] == 0].shape[0] # the beginning of 2020
employee_end = df_attrition.shape[0] # the end of 2020
employee_avg = (employee_start + employee_end) / 2 # average of employees number in 2020
attrition_total = df_attrition.loc[df_attrition['Attrition'] == 'Yes'].shape[0] # total attrition
turnover_rate = attrition_total / employee_avg # turnover rate
print("Employee Attrition in 2020")
print("--------------------------------------------")
# attrition number in 2020
print(f"Average of Employees Number = {employee_avg:.0f} employees")
print(f"Total Attrition Number = {attrition_total} employees")
# specified for each job level
for i in sorted(df_attrition['JobLevel'].unique()):
attrition_level = df_attrition.loc[(df_attrition['Attrition'] == 'Yes') & (df_attrition['JobLevel'] == i)].shape[0]
print(f" - Job Level {i} = {attrition_level} employees")
print("--------------------------------------------")
# turnover rate in 2020
print(f"Attrition Rate = {turnover_rate:.2%}")
Employee Attrition in 2020 -------------------------------------------- Average of Employees Number = 1448 employees Total Attrition Number = 237 employees - Job Level 1 = 143 employees - Job Level 2 = 52 employees - Job Level 3 = 32 employees - Job Level 4 = 5 employees - Job Level 5 = 5 employees -------------------------------------------- Attrition Rate = 16.37%
plt.figure(figsize = (18, 14))
for i in range(0, len(quant)):
plt.subplot(4, 4, i+1)
sns.histplot(df[quant[i]], color='navy', kde=True, stat="density") # Add kde and stat
plt.title(quant[i], fontsize = 14)
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
plt.show()
plt.figure(figsize = (18, 12))
for i in range(0, len(qual)):
plt.subplot(4, 5, i+1)
sns.countplot(x = qual[i], data = df_qual, color = '#005495', orient = 'v')
plt.title(qual[i], fontsize = 14)
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
plt.show()
Observations:
The attrition rate of ABC Technologies is 16.37%
From the QUANTITATIVE attributes Descriptive Analytics, we can observe that:
MonthlyIncome, PercentSalaryHike, YearsAtCompany columns are slightly skewed to the right/positively skewedApart from those columns, the distribution looks normal
From the QUANTITATIVE attributes Descriptive Analytics, we can observe that:
quant_ord = quant + qual_o # quantitative + qualitative ordinal
plt.figure(figsize = (15, 15))
sns.heatmap(df[quant_ord].corr(), cmap = 'YlGnBu', annot = True, annot_kws={"size": 8}, fmt = '.2f')
plt.show()
Observations:
PercentSalaryHike and PerformanceRating have a fairly strong positive relationship
TotalWorkingYears has a fairly strong positive relationship with Age, MonthlyIncome, and JobLevel
YearsAtCompany has a fairly strong positive relationship with YearsInCurrentRole and YearsWithCurrManager</font>
# check the sum of numeric data + categorical ordinal
print(f"The amount of quantitative data = {len(quant)}")
The amount of quantitative data = 16
plt.figure(figsize = (18, 14))
for i in range(0, len(quant)):
plt.subplot(4, 4, i+1)
sns.kdeplot(x = quant[i], hue = 'Attrition', data = df, palette = 'colorblind', linewidth = 1, fill = True, legend = True)
plt.axvline(df[quant[i]].loc[df['Attrition'] == 'No'].median(), color = '#5088bd', linewidth = 0.8, linestyle = '-.', label = 'No')
plt.axvline(df[quant[i]].loc[df['Attrition'] == 'Yes'].median(), color = '#005495', linewidth = 0.8, linestyle = '-.', label = 'Yes')
plt.title(quant[i], fontsize = 12)
plt.xlabel('')
plt.ylabel('')
plt.legend(['No (Median)', 'Yes (Median)', 'Attrition: No', 'Attrition: Yes',], ncol = 2, loc = 'upper center')
plt.tight_layout(pad = 2.0)
plt.show()
# check the amount of qualitative data
print(f"The amount of qualitative data = {len(qual)-1}") # except attrition
The amount of qualitative data = 18
plt.figure(figsize = (20, 30))
k = 0
for i in range(1, len(qual)):
# attrition num
plt.subplot(9, 4, i+k)
sns.countplot(x = qual[i], hue = 'Attrition', data = df_qual, palette = 'colorblind', orient = 'v')
plt.title(f"{qual[i]}: Attrition Num", fontsize = 14)
plt.xlabel('')
plt.ylabel('')
# attrition rate
df_temp = df_qual.groupby([qual[i], 'Attrition']).size().reset_index().rename({0 : 'Size'}, axis = 1)
df_temp['TotalSize'] = df_temp.groupby([qual[i]])['Size'].transform(sum)
df_temp['AttritionRate'] = df_temp['Size'] / df_temp['TotalSize']
df_temp = df_temp.loc[df_temp['Attrition'] == 'Yes'].reset_index(drop = True)
df_temp = df_temp.sort_values(by = 'AttritionRate', ascending = False)
plt.subplot(9, 4, i+k+1)
sns.barplot(y = qual[i], x = 'AttritionRate', data = df_temp, palette = 'colorblind', orient = 'h')
plt.title(f"{qual[i]}: Attrition Rate", fontsize = 14)
plt.xlabel('')
plt.ylabel('')
plt.xlim(0.0, 0.5)
k += 1
plt.tight_layout(w_pad = 0.5)
plt.show()
Attributes: JobLevel, PerformanceRating, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion
Observations:
YearsAtCompany) for a few years (0-5 years) to decide to leave the company, compared to those who have worked for more than 10 years and remain at the company.Attributes: DailyRate, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, StockOptionLevel
Observations:
Attributes: Age, Education, EducationField, Gender, MaritalStatus, NumCompaniesWorked, Over18
Observations:
Attributes: JobInvolvement, JobRole, JobSatisfaction, RelationshipSatisfaction
Observations:
Attribute: YearsWithCurrManager
Observations:
Attributes: BusinessTravel, Department, DistanceFromHome ,EmployeeCount, EnvironmentSatisfaction, OverTime, StandardHours
, WorkLifeBalance
Observations:
EnvironmentSatisfaction) have a fairly high attrition rate.# Add "Group Age" Feature
df_clean['GroupAge'] = np.where(df_clean['Age'] <= 30, '18-30',
np.where((df_clean['Age'] > 30) & (df_clean['Age'] <= 40), '31-40',
np.where((df_clean['Age'] > 40) & (df_clean['Age'] <= 50), '41-50', '51-60')))
df_clean[['Age', 'GroupAge']]
| Age | GroupAge | |
|---|---|---|
| 0 | 41 | 41-50 |
| 1 | 49 | 41-50 |
| 2 | 37 | 31-40 |
| 3 | 33 | 31-40 |
| 4 | 27 | 18-30 |
| ... | ... | ... |
| 1465 | 36 | 31-40 |
| 1466 | 39 | 31-40 |
| 1467 | 27 | 18-30 |
| 1468 | 49 | 41-50 |
| 1469 | 34 | 31-40 |
1387 rows × 2 columns
# Median Monthly Income By Job Level
MedIncome = df_clean.groupby(['JobLevel'])['MonthlyIncome'].median()
MedIncome
JobLevel 1 2661.5 2 5343.0 3 9852.0 4 15992.0 5 19174.0 Name: MonthlyIncome, dtype: float64
# Add Feature Median Monthly Income By Job Level (below == 1, above == 0)
df_clean['BelowMedIncome'] = df_clean['JobLevel'].apply(lambda x: MedIncome[x])
df_clean['BelowMedIncome'] = np.where(df_clean['MonthlyIncome'] < df_clean['BelowMedIncome'], 1, 0)
# added group age (18-30) & overtime (yes) features
df_clean['GroupAge_Overtime'] = np.where((df_clean['GroupAge'] == '18-30') & (df_clean['OverTime'] == 'Yes'), 1, 0)
# added job level (1) & overtime (yes) features
df_clean['JobLevel_Overtime'] = np.where((df_clean['JobLevel'] == 1) & (df_clean['OverTime'] == 'Yes'), 1, 0)
# added job level (1) & below median monthly income (1) & overtime (yes)
df_clean['JobLevel_BelowMedIncome_Overtime'] = np.where((df_clean['JobLevel'] == 1) & (df_clean['BelowMedIncome'] == 1) & (df_clean['OverTime'] == 'Yes'), 1, 0)
df_clean
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | GroupAge | BelowMedIncome | GroupAge_Overtime | JobLevel_Overtime | JobLevel_BelowMedIncome_Overtime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Yes | 11 | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 | 41-50 | 0 | 0 | 0 | 0 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | No | 23 | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 | 41-50 | 1 | 0 | 0 | 0 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Yes | 15 | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 | 31-40 | 1 | 0 | 1 | 1 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Yes | 11 | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 | 31-40 | 0 | 0 | 1 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | No | 12 | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 | 18-30 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1465 | 36 | No | Travel_Frequently | 884 | Research & Development | 23 | 2 | Medical | 3 | Male | 41 | 4 | 2 | Laboratory Technician | 4 | Married | 2571 | 12290 | 4 | No | 17 | 3 | 3 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 | 31-40 | 1 | 0 | 0 | 0 |
| 1466 | 39 | No | Travel_Rarely | 613 | Research & Development | 6 | 1 | Medical | 4 | Male | 42 | 2 | 3 | Healthcare Representative | 1 | Married | 9991 | 21457 | 4 | No | 15 | 3 | 1 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 | 31-40 | 0 | 0 | 0 | 0 |
| 1467 | 27 | No | Travel_Rarely | 155 | Research & Development | 4 | 3 | Life Sciences | 2 | Male | 87 | 4 | 2 | Manufacturing Director | 2 | Married | 6142 | 5174 | 1 | Yes | 20 | 4 | 2 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 | 18-30 | 0 | 1 | 0 | 0 |
| 1468 | 49 | No | Travel_Frequently | 1023 | Sales | 2 | 3 | Medical | 4 | Male | 63 | 2 | 2 | Sales Executive | 2 | Married | 5390 | 13243 | 2 | No | 14 | 3 | 4 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 | 41-50 | 0 | 0 | 0 | 0 |
| 1469 | 34 | No | Travel_Rarely | 628 | Research & Development | 8 | 3 | Medical | 2 | Male | 82 | 4 | 2 | Laboratory Technician | 3 | Married | 4404 | 10228 | 2 | No | 12 | 3 | 1 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 | 31-40 | 1 | 0 | 0 | 0 |
1387 rows × 36 columns
Perform standardization if the numeric feature has a normal distribution, if not, then use normalization.
# save data for simulation
df_simulation = df_clean.copy()
df_simulation.to_csv(f"{gd_path}/dataset/data_valid_all_sim.csv", index = False)
# standardization scaling
std_cols = ['Age']
for col in std_cols:
x = df_clean[col].values.reshape(len(df_clean), 1)
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
df_clean[col] = x_scaled
# normalization scaling
nrm_cols = ['DailyRate', 'DistanceFromHome', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
for col in nrm_cols:
x = df_clean[col].values.reshape(len(df_clean), 1)
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)
df_clean[col] = x_scaled
df_clean[std_cols + nrm_cols].head(10)
| Age | DailyRate | DistanceFromHome | HourlyRate | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | TotalWorkingYears | TrainingTimesLastYear | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.536681 | 0.716332 | 0.000000 | 0.914286 | 0.263230 | 0.698016 | 0.888889 | 0.000000 | 0.235294 | 0.000000 | 0.24 | 0.266667 | 0.000000 | 0.357143 |
| 1 | 1.442111 | 0.126791 | 0.250000 | 0.442857 | 0.217651 | 0.915991 | 0.111111 | 0.857143 | 0.294118 | 0.500000 | 0.40 | 0.466667 | 0.090909 | 0.500000 |
| 2 | 0.083966 | 0.910458 | 0.035714 | 0.885714 | 0.057093 | 0.012007 | 0.666667 | 0.285714 | 0.205882 | 0.500000 | 0.00 | 0.000000 | 0.000000 | 0.000000 |
| 3 | -0.368749 | 0.924069 | 0.071429 | 0.371429 | 0.100349 | 0.845796 | 0.111111 | 0.000000 | 0.235294 | 0.500000 | 0.32 | 0.466667 | 0.272727 | 0.000000 |
| 4 | -1.047821 | 0.350287 | 0.035714 | 0.142857 | 0.129872 | 0.583688 | 1.000000 | 0.071429 | 0.176471 | 0.500000 | 0.08 | 0.133333 | 0.181818 | 0.142857 |
| 5 | -0.481928 | 0.646848 | 0.035714 | 0.700000 | 0.108746 | 0.392217 | 0.000000 | 0.142857 | 0.235294 | 0.333333 | 0.28 | 0.466667 | 0.272727 | 0.428571 |
| 6 | 2.573899 | 0.875358 | 0.071429 | 0.728571 | 0.087726 | 0.315918 | 0.444444 | 0.642857 | 0.352941 | 0.500000 | 0.04 | 0.000000 | 0.000000 | 0.000000 |
| 7 | -0.708285 | 0.899713 | 0.821429 | 0.528571 | 0.088941 | 0.451289 | 0.111111 | 0.785714 | 0.029412 | 0.333333 | 0.04 | 0.000000 | 0.000000 | 0.000000 |
| 8 | 0.197145 | 0.081662 | 0.785714 | 0.200000 | 0.449826 | 0.268653 | 0.000000 | 0.714286 | 0.294118 | 0.333333 | 0.36 | 0.466667 | 0.090909 | 0.571429 |
| 9 | -0.029213 | 0.857450 | 0.928571 | 0.914286 | 0.223302 | 0.581479 | 0.666667 | 0.142857 | 0.500000 | 0.500000 | 0.28 | 0.466667 | 0.636364 | 0.500000 |
Determine the categorical features that need to be labeled normally (0, 1, etc.) or by one-hot encoding.
# traditional labeling
label_cols = ['Attrition', 'OverTime', 'Gender', 'PerformanceRating']
df_clean['Attrition'] = df_clean['Attrition'].map({'No' : 0, 'Yes' : 1})
df_clean['OverTime'] = df_clean['OverTime'].map({'No' : 0, 'Yes' : 1})
df_clean['Gender'] = df_clean['Gender'].map({'Female' : 0, 'Male' : 1})
df_clean['PerformanceRating'] = df_clean['PerformanceRating'].map({3 : 0, 4 : 1}) # only consist of 2 unique values
# one-hot encoding
onehot_cols = ['BusinessTravel', 'Department', 'EducationField', 'JobRole', 'MaritalStatus', 'Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance', 'GroupAge']
result_cols = []
for col in onehot_cols:
onehot = pd.get_dummies(df_clean[col], prefix = col)
dummies_cols = list(onehot.columns)
result_cols = result_cols + dummies_cols
df_clean = df_clean.join(onehot)
df_clean = df_clean.drop(onehot_cols, axis = 1)
df_clean[label_cols + result_cols].head()
| Attrition | OverTime | Gender | PerformanceRating | BusinessTravel_Non-Travel | BusinessTravel_Travel_Frequently | BusinessTravel_Travel_Rarely | Department_Human Resources | Department_Research & Development | Department_Sales | EducationField_Human Resources | EducationField_Life Sciences | EducationField_Marketing | EducationField_Medical | EducationField_Other | EducationField_Technical Degree | JobRole_Healthcare Representative | JobRole_Human Resources | JobRole_Laboratory Technician | JobRole_Manager | JobRole_Manufacturing Director | JobRole_Research Director | JobRole_Research Scientist | JobRole_Sales Executive | JobRole_Sales Representative | ... | JobLevel_1 | JobLevel_2 | JobLevel_3 | JobLevel_4 | JobLevel_5 | JobSatisfaction_1 | JobSatisfaction_2 | JobSatisfaction_3 | JobSatisfaction_4 | RelationshipSatisfaction_1 | RelationshipSatisfaction_2 | RelationshipSatisfaction_3 | RelationshipSatisfaction_4 | StockOptionLevel_0 | StockOptionLevel_1 | StockOptionLevel_2 | StockOptionLevel_3 | WorkLifeBalance_1 | WorkLifeBalance_2 | WorkLifeBalance_3 | WorkLifeBalance_4 | GroupAge_18-30 | GroupAge_31-40 | GroupAge_41-50 | GroupAge_51-60 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | 0 | False | False | True | False | False | True | False | True | False | False | False | False | False | False | False | False | False | False | False | True | False | ... | False | True | False | False | False | False | False | False | True | True | False | False | False | True | False | False | False | True | False | False | False | False | False | True | False |
| 1 | 0 | 0 | 1 | 1 | False | True | False | False | True | False | False | True | False | False | False | False | False | False | False | False | False | False | True | False | False | ... | False | True | False | False | False | False | True | False | False | False | False | False | True | False | True | False | False | False | False | True | False | False | False | True | False |
| 2 | 1 | 1 | 1 | 0 | False | False | True | False | True | False | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | ... | True | False | False | False | False | False | False | True | False | False | True | False | False | True | False | False | False | False | False | True | False | False | True | False | False |
| 3 | 0 | 1 | 0 | 0 | False | True | False | False | True | False | False | True | False | False | False | False | False | False | False | False | False | False | True | False | False | ... | True | False | False | False | False | False | False | True | False | False | False | True | False | True | False | False | False | False | False | True | False | False | True | False | False |
| 4 | 0 | 0 | 1 | 0 | False | False | True | False | True | False | False | False | False | True | False | False | False | False | True | False | False | False | False | False | False | ... | True | False | False | False | False | False | True | False | False | False | False | False | True | False | True | False | False | False | False | True | False | True | False | False | False |
5 rows × 66 columns
x = df_clean.drop(['Attrition'], axis = 1) # features
y = df_clean[['Attrition']] # target
# train = training data for modeling
# test = unseen data for validation
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1, random_state = 42)
df_clean = pd.DataFrame(y_train).join(x_train)
df_unseen = pd.DataFrame(y_test).join(x_test)
print(df_clean.shape)
print(df_unseen.shape)
(1248, 84) (139, 84)
df_clean['Attrition'].value_counts()
Attrition 0 1051 1 197 Name: count, dtype: int64
Clean Data [Attrition] Attribute Counts:
df_unseen['Attrition'].value_counts()
Attrition 0 107 1 32 Name: count, dtype: int64
Unseen data [Attrition] Attribute Counts:
Select which features are needed for modeling based on EDA observations or other considerations such as experience.
df_selection = df_clean.copy()
x = df_selection.drop(['Attrition', 'GroupAge_18-30', 'GroupAge_31-40', 'GroupAge_41-50', 'GroupAge_51-60', 'BelowMedIncome', 'GroupAge_Overtime', 'JobLevel_Overtime', 'JobLevel_BelowMedIncome_Overtime'], axis = 1) # all features exclude new features
y = df_selection[['Attrition']] # target
# fitting the data
model = ExtraTreesClassifier(random_state = 42)
model.fit(x, y)
result = model.feature_importances_
# convert to dataframe
feat_importances = pd.DataFrame({'Feature' : x.columns, 'Importance' : result})
feat_importances['Feature_Display'] = feat_importances['Feature'].apply(lambda x: " ".join(re.findall('[A-Z][^A-Z]*', x)))
feat_importances['Feature_Display'] = feat_importances['Feature_Display'].apply(lambda x: x.replace("_", ": "))
feat_importances
| Feature | Importance | Feature_Display | |
|---|---|---|---|
| 0 | Age | 0.028026 | Age |
| 1 | DailyRate | 0.018761 | Daily Rate |
| 2 | DistanceFromHome | 0.023421 | Distance From Home |
| 3 | Gender | 0.013839 | Gender |
| 4 | HourlyRate | 0.020784 | Hourly Rate |
| ... | ... | ... | ... |
| 70 | StockOptionLevel_3 | 0.007303 | Stock Option Level: 3 |
| 71 | WorkLifeBalance_1 | 0.010921 | Work Life Balance: 1 |
| 72 | WorkLifeBalance_2 | 0.015137 | Work Life Balance: 2 |
| 73 | WorkLifeBalance_3 | 0.014842 | Work Life Balance: 3 |
| 74 | WorkLifeBalance_4 | 0.008583 | Work Life Balance: 4 |
75 rows × 3 columns
# set plot size
plt.figure(figsize = (9, 6))
# plotting the feature importances
data = feat_importances.sort_values(['Importance'], ascending = False).iloc[:20] # top 20
sns.barplot(y = 'Feature_Display', x = 'Importance', data = data, palette = 'colorblind')
plt.title('Top 20 Key Factor of Employee Attrition', fontweight = 'bold', pad = 20)
plt.xlabel('') # clear label
plt.ylabel('') # clear label
plt.tight_layout()
plt.show()
Overtime is very important in considering employee attrition, very different from other features.
We will create a model to predict employee attrition with features that have a feature importance scale greater than 0.02.
And added features:
Department_SalesDepartment_ResearchAndDevelopmentDepartment_HumanResourcesJobSatisfaction_1RelationshipSatisfaction_1# filtering features
selected_features = list(feat_importances.loc[feat_importances['Importance'] >= 0.02]['Feature'])
add_features1 = ['Department_Sales', 'Department_Research & Development', 'Department_Human Resources', 'JobSatisfaction_1', 'RelationshipSatisfaction_1']
add_features2 = ['GroupAge_18-30', 'GroupAge_31-40', 'GroupAge_41-50', 'GroupAge_51-60', 'BelowMedIncome', 'GroupAge_Overtime', 'JobLevel_Overtime', 'JobLevel_BelowMedIncome_Overtime']
# clean data and validation of all features
df_clean_all = df_clean.copy()
df_valid_all = df_unseen.copy()
# clean and validated data containing feature importance > 0.02 and additional group agreement features
df_clean_selected = df_clean[['Attrition']].join(df_clean[selected_features + add_features1])
df_valid_selected = df_unseen[['Attrition']].join(df_unseen[selected_features + add_features1])
# clean and validated data containing feature importance > 0.02, additional group agreement features, and additional new features (group age, etc.)
df_clean_selected_new = df_clean[['Attrition']].join(df_clean[selected_features + add_features1 + add_features2])
df_valid_selected_new = df_unseen[['Attrition']].join(df_unseen[selected_features + add_features1 + add_features2])
# check categorical data type
df[qual].dtypes
Attrition object BusinessTravel object Department object EducationField object EmployeeNumber int64 Gender object JobRole object MaritalStatus object Over18 object OverTime object Education int64 EnvironmentSatisfaction int64 JobInvolvement int64 JobLevel int64 JobSatisfaction int64 PerformanceRating int64 RelationshipSatisfaction int64 StockOptionLevel int64 WorkLifeBalance int64 dtype: object
df['Department'].value_counts()
Department Research & Development 961 Sales 446 Human Resources 63 Name: count, dtype: int64
df_imp = df.copy()
df_imp['BusinessTravel'] = df_imp['BusinessTravel'].map({'Non-Travel' : 0, 'Travel_Rarely' : 1, 'Travel_Frequently' : 2})
df_imp['Gender'] = df_imp['Gender'].map({'Male' : 0, 'Female' : 1})
df_imp['OverTime'] = df_imp['OverTime'].map({'No' : 0, 'Yes' : 1})
df_imp['MaritalStatus'] = df_imp['MaritalStatus'].map({'Single' : 0, 'Married' : 1, 'Divorced' : 2})
df_imp['EducationField'] = df_imp['EducationField'].astype('category').cat.codes
df_imp['JobRole'] = df_imp['JobRole'].astype('category').cat.codes
df_imp = df_imp.drop(['Over18', 'EmployeeNumber'], axis = 1)
# check data type after encoding
df_imp.dtypes
Age int64 Attrition object BusinessTravel int64 DailyRate int64 Department object DistanceFromHome int64 Education int64 EducationField int8 EmployeeCount int64 EnvironmentSatisfaction int64 Gender int64 HourlyRate int64 JobInvolvement int64 JobLevel int64 JobRole int8 JobSatisfaction int64 MaritalStatus int64 MonthlyIncome int64 MonthlyRate int64 NumCompaniesWorked int64 OverTime int64 PercentSalaryHike int64 PerformanceRating int64 RelationshipSatisfaction int64 StandardHours int64 StockOptionLevel int64 TotalWorkingYears int64 TrainingTimesLastYear int64 WorkLifeBalance int64 YearsAtCompany int64 YearsInCurrentRole int64 YearsSinceLastPromotion int64 YearsWithCurrManager int64 dtype: object
df_imp['Department'].nunique()
3
dept_unique = df_imp['Department'].unique()
fig = plt.figure(figsize = (20, 6))
for i in range(0, len(dept_unique)):
x = df_imp.loc[df_imp['Department'] == dept_unique[i]].drop(['Attrition', 'Department'], axis = 1) # feature
y = df_imp.loc[df_imp['Department'] == dept_unique[i]]['Attrition'] # target
feat_importance = ExtraTreesClassifier().fit(x, y).feature_importances_
feat_imp = pd.DataFrame({'column' : x.columns, 'value' : feat_importance}).sort_values('value', ascending = False).reset_index(drop = True)
feat_imp['column'] = feat_imp['column'].apply(lambda x: " ".join(re.findall('[A-Z][^A-Z]*', x)))
plt.subplot(1, 3, i+1)
sns.barplot(y = 'column', x = 'value', data = feat_imp.iloc[0:10], palette = 'colorblind')
plt.title(f"{df_imp['Department'].unique()[i]} Department", fontweight = 'bold', fontsize = 14)
plt.xlabel('')
plt.ylabel('')
plt.xlim(0.0, 0.095)
fig.suptitle("The Top 10 Key Factor of Attrition in Each Department", fontweight = 'bold', y = 1.1)
fig.text(x = 0, y = -0.05, s = "Overtime is a key factor of attrition in Sales and R&D Department, but not in HR Department. Most employees in HR Department decided to resign because of the distance from their homes. \nIn addition, most of the HR employees who resign from the company are young (under 30 years old). It should be noted that each department has a different reason why they left the company.", ha = 'left', va = 'center_baseline', color = '#555b6e', fontsize = 16)
fig.tight_layout(w_pad = 0.5)
fig.show()
# # save the cleaned data (ready to model)
# fname_all = 'dataset/data_clean_all.csv'
# fname_selected = 'dataset/data_clean_selected.csv'
# fname_selected_new = 'dataset/data_clean_selected_new.csv'
# fnames = [fname_all, fname_selected, fname_selected_new]
# df_export = [df_clean_all, df_clean_selected, df_clean_selected_new]
# for x_fname, y_df in zip(fnames, df_export):
# fname = f"{gd_path}/{x_fname}"
# y_df.to_csv(fname, index = False)
# # save unseen data
# fname_all = 'dataset/data_valid_all.csv'
# fname_selected = 'dataset/data_valid_selected.csv'
# fname_selected_new = 'dataset/data_valid_selected_new.csv'
# fnames = [fname_all, fname_selected, fname_selected_new]
# df_export = [df_valid_all, df_valid_selected, df_valid_selected_new]
# for x_fname, y_df in zip(fnames, df_export):
# fname = f"{gd_path}/{x_fname}"
# y_df.to_csv(fname, index = False)
Taking glance at Columns and Data Types
# Let's see the list of columns
df.dtypes
Age int64 Attrition object BusinessTravel object DailyRate int64 Department object DistanceFromHome int64 Education int64 EducationField object EmployeeCount int64 EmployeeNumber int64 EnvironmentSatisfaction int64 Gender object HourlyRate int64 JobInvolvement int64 JobLevel int64 JobRole object JobSatisfaction int64 MaritalStatus object MonthlyIncome int64 MonthlyRate int64 NumCompaniesWorked int64 Over18 object OverTime object PercentSalaryHike int64 PerformanceRating int64 RelationshipSatisfaction int64 StandardHours int64 StockOptionLevel int64 TotalWorkingYears int64 TrainingTimesLastYear int64 WorkLifeBalance int64 YearsAtCompany int64 YearsInCurrentRole int64 YearsSinceLastPromotion int64 YearsWithCurrManager int64 dtype: object